28 July 2017
## Might be a while...
install.packages(c("ggplot2","dplyr","tidyr","janitor","plotly",
"devtools","learnr","gapminder"))
library(devtools)
install_github("kevinwang09/2017_STAT3914", subdir = "learnr3914")
iris dataset. Typing iris into R console should load this data. Pay attention to its column, row names and structure of each column.Statisticians are great at many things:
But the mother of all these, i.e. preparing data is not trivial. (e.g. STAT2xxx lab exams)
dirtyIris data to be exactly the same as the original iris data.readr and readxljanitormagrittrdplyrggplot2base R functions are not sufficient for modern uses.
readr functions are superior in data import warnings, column type handling, speed, scalability and consistency.
library(readr)
dirtyIris = readr::read_csv("dirtyIris.csv")
Parsed with column specification: cols( SepAl....LeNgth = col_double(), `Sepal.? Width` = col_double(), `petal.Length(*&^` = col_double(), `petal.$#^&Width` = col_double(), `SPECIES^` = col_character(), allEmpty = col_character() )
class(dirtyIris) ## `tibble` is a `data.frame` with better formatting.
[1] "tbl_df" "tbl" "data.frame"
readxl and haven (for SAS, SPSS etc) packages work similarly.dirtyIris
# A tibble: 250 x 6
SepAl....LeNgth `Sepal.? Width` `petal.Length(*&^` `petal.$#^&Width`
<dbl> <dbl> <dbl> <dbl>
1 5.80000000 2.6 4.0 1.2
2 5.80000000 2.7 5.1 1.9
3 0.01874617 NA NA NA
4 NA NA NA NA
5 NA NA NA NA
# ... with 245 more rows, and 2 more variables: `SPECIES^` <chr>,
# allEmpty <chr>
dirtyIris dataset.Here is a dataset. Click here.
.gmt file and who publishes this format?.gmt files?R? Is it a data.frame?Clean data is a data set that allows you to do statistical modelling without extra processing
Cricket_australia vs cricket.Australia)Clean data is a well-designed data.frame.
Column type (esp. dates and factors) handling was the primary reason we used readr instead of base R when importing data.
Our goal: clean the dirtyIris data to be exactly the same as the original iris data.
janitor package.dplyr.library(janitor) glimpse(dirtyIris)
Observations: 250 Variables: 6 $ SepAl....LeNgth <dbl> 5.80000000, 5.80000000, 0.01874617, NA, NA, 6... $ Sepal.? Width <dbl> 2.6, 2.7, NA, NA, NA, 3.0, NA, 2.9, NA, NA, 3... $ petal.Length(*&^ <dbl> 4.0, 5.1, NA, NA, NA, 5.5, NA, 5.6, NA, NA, 1... $ petal.$#^&Width <dbl> 1.2, 1.9, NA, NA, NA, 1.8, NA, 1.8, NA, NA, 0... $ SPECIES^ <chr> "versicolor", "virginica", NA, NA, NA, "virgi... $ allEmpty <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## Clean up column names better = clean_names(dirtyIris) glimpse(better)
Observations: 250 Variables: 6 $ sepal_length <dbl> 5.80000000, 5.80000000, 0.01874617, NA, NA, 6.500... $ sepal_width <dbl> 2.6, 2.7, NA, NA, NA, 3.0, NA, 2.9, NA, NA, 3.8, ... $ petal_length <dbl> 4.0, 5.1, NA, NA, NA, 5.5, NA, 5.6, NA, NA, 1.9, ... $ petal_width <dbl> 1.2, 1.9, NA, NA, NA, 1.8, NA, 1.8, NA, NA, 0.4, ... $ species <chr> "versicolor", "virginica", NA, NA, NA, "virginica... $ allempty <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## Removing empty rows/columns evenBetter = remove_empty_rows(better) evenBetter = remove_empty_cols(evenBetter) glimpse(evenBetter)
Observations: 209 Variables: 5 $ sepal_length <dbl> 5.80000000, 5.80000000, 0.01874617, 6.50000000, 6... $ sepal_width <dbl> 2.6, 2.7, NA, 3.0, 2.9, NA, 3.8, 2.8, NA, 3.0, 3.... $ petal_length <dbl> 4.0, 5.1, NA, 5.5, 5.6, NA, 1.9, 4.8, NA, 4.2, 5.... $ petal_width <dbl> 1.2, 1.9, NA, 1.8, 1.8, NA, 0.4, 1.8, NA, 1.5, 2.... $ species <chr> "versicolor", "virginica", NA, "virginica", "virg...
Genuinely missing values should be retained, but in this case, the NA's were added.
Removing any rows with NA
evenBetterBetter = remove_missing(evenBetter) cleanIris = evenBetterBetter
glimpse(cleanIris)
Observations: 150 Variables: 5 $ sepal_length <dbl> 5.8, 5.8, 6.5, 6.3, 5.1, 6.2, 5.9, 6.8, 5.0, 6.7,... $ sepal_width <dbl> 2.6, 2.7, 3.0, 2.9, 3.8, 2.8, 3.0, 3.2, 3.0, 3.0,... $ petal_length <dbl> 4.0, 5.1, 5.5, 5.6, 1.9, 4.8, 4.2, 5.9, 1.6, 5.0,... $ petal_width <dbl> 1.2, 1.9, 1.8, 1.8, 0.4, 1.8, 1.5, 2.3, 0.2, 1.7,... $ species <chr> "versicolor", "virginica", "virginica", "virginic...
glimpse(iris)
Observations: 150 Variables: 5 $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,... $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,... $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,... $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,... $ Species <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...
mean(cleanIris$sepal_length)
[1] 5.843333
plot(density(cleanIris$sepal_length), col = "red", lwd = 2)
We introduce a new notation: " x %>% f " means "f(x)". We call this operation as "x pipe f".
Compounded operations are possible. Keyboard shortcut is Cmd+shift+M.
cleanIris$sepal_length %>% mean
[1] 5.843333
cleanIris$sepal_length %>% density %>% plot(col = "red", lwd = 2)
dplyr: data subsetting mastersepal_length and sepal_width in the cleanIris data:## Assuming you know the position of column names.
## But what if you resample your data?
cleanIris[1:2, c(1, 2)]
## Assuming you know the position of column names.
## Also assuming the first two columns satisfy certain properties.
cleanIris[1:2, c(T, T, F, F, F)]
## Much better!
## What if you can't type out all the column names
## due to the size of your data?
cleanIris[1:2, c("sepal_length", "sepal_width")]
cleanIris[(cleanIris[,"sepal_length"] < 5) &
(cleanIris[,"sepal_width"] < 3), c("petal_length", "sepal_length")]
# A tibble: 4 x 2
petal_length sepal_length
<dbl> <dbl>
1 1.3 4.5
2 1.4 4.4
3 3.3 4.9
4 4.5 4.9
R user might know about the subset function, but it suffers the same problem of not able to have multiple subsetting criteria without predefined variables.subset(cleanIris,
subset = (sepal_length < 5) & (sepal_width < 3),
select = grep("length", colnames(cleanIris), value = TRUE))
# A tibble: 4 x 2
sepal_length petal_length
<dbl> <dbl>
1 4.5 1.3
2 4.4 1.4
3 4.9 3.3
4 4.9 4.5
select columns are operations on variables, andfilter rows are operations on observations
See dplyr cheatsheet.
library(dplyr)
cleanIris %>%
filter(sepal_length < 5,
sepal_width < 3) %>%
select(contains("length"))
# A tibble: 4 x 2
sepal_length petal_length
<dbl> <dbl>
1 4.5 1.3
2 4.4 1.4
3 4.9 3.3
4 4.9 4.5
mutate creates new columnsiris_mutated = mutate(cleanIris,
newVar = sepal_length + petal_length,
newNewVar = newVar*2)
iris_mutated
# A tibble: 150 x 7
sepal_length sepal_width petal_length petal_width species newVar
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 5.8 2.6 4.0 1.2 versicolor 9.8
2 5.8 2.7 5.1 1.9 virginica 10.9
3 6.5 3.0 5.5 1.8 virginica 12.0
4 6.3 2.9 5.6 1.8 virginica 11.9
5 5.1 3.8 1.9 0.4 setosa 7.0
# ... with 145 more rows, and 1 more variables: newNewVar <dbl>
group_by + summarise will create summary statistics for grouped variablesbySpecies = cleanIris %>% group_by(species) bySpecies
# A tibble: 150 x 5
# Groups: species [3]
sepal_length sepal_width petal_length petal_width species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.8 2.6 4.0 1.2 versicolor
2 5.8 2.7 5.1 1.9 virginica
3 6.5 3.0 5.5 1.8 virginica
4 6.3 2.9 5.6 1.8 virginica
5 5.1 3.8 1.9 0.4 setosa
# ... with 145 more rows
bySpecies %>% summarise(meanSepalLength = mean(sepal_length))
# A tibble: 3 x 2
species meanSepalLength
<chr> <dbl>
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
left_join for merging dataflowers = data.frame(species = c("setosa", "versicolor", "virginica"),
comments = c("meh", "kinda_okay", "love_it!"))
## cleanIris has the priority in this join operation
iris_comments = left_join(cleanIris, flowers, by = "species")
## Randomly sampling 6 rows
sample_n(iris_comments, 6)
# A tibble: 6 x 6
sepal_length sepal_width petal_length petal_width species comments
<dbl> <dbl> <dbl> <dbl> <chr> <fctr>
1 5.8 2.6 4.0 1.2 versicolor kinda_okay
2 6.4 2.8 5.6 2.2 virginica love_it!
3 6.1 3.0 4.6 1.4 versicolor kinda_okay
4 4.9 2.4 3.3 1.0 versicolor kinda_okay
5 5.5 3.5 1.3 0.2 setosa meh
6 6.0 2.2 4.0 1.0 versicolor kinda_okay
arrange for ordering rowsarrangeCleanIris = cleanIris %>% arrange(sepal_length, sepal_width, petal_length, petal_width) ## The true iris data arrangeIris = iris %>% clean_names() %>% arrange(sepal_length, sepal_width, petal_length, petal_width)
dirtyIris data and the arranged iris data.## The `Species` column is character or factor all.equal(arrangeCleanIris, arrangeIris)
[1] "Incompatible type for column `species`: x character, y factor"
arrangeIris = arrangeIris %>% mutate(species = as.character(species)) ## Great! all.equal(arrangeCleanIris, arrangeIris)
[1] TRUE
## identical is stronger version of all.equal ## arrangeCleanIris is a tibble ## but arrangeIris is a data.frame identical(arrangeCleanIris, arrangeIris)
[1] FALSE
class(arrangeCleanIris)
[1] "tbl_df" "tbl" "data.frame"
class(arrangeIris)
[1] "data.frame"
## Now, they are exactly the same! identical(arrangeCleanIris, arrangeIris %>% as.tibble)
[1] TRUE
cleanIris %>%
select(ends_with("length"))
# A tibble: 150 x 2
sepal_length petal_length
<dbl> <dbl>
1 5.8 4.0
2 5.8 5.1
3 6.5 5.5
4 6.3 5.6
5 5.1 1.9
# ... with 145 more rows
cleanIris %>%
select(starts_with("sepal"))
# A tibble: 150 x 2
sepal_length sepal_width
<dbl> <dbl>
1 5.8 2.6
2 5.8 2.7
3 6.5 3.0
4 6.3 2.9
5 5.1 3.8
# ... with 145 more rows
bySpecies %>%
summarise_if(is.numeric,
funs(m = mean))
# A tibble: 3 x 5
species sepal_length_m sepal_width_m petal_length_m petal_width_m
<chr> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
ggplot2: the best visualisation packagelibrary(learnr3914) learnggplot2()
Otherwise, please download and compile the "ggplot2_basic_tutorial.Rmd" here
If all fails, please send me an email at kevin.wang@sydney.edu.au.
ggplot2 is that, its design will force you to use a certain grammar when producing a plot. So much so, every plot you produce is actually a statistic.library(ggplot2)
p1 =
iris %>%
ggplot(aes(x = Sepal.Length,
y = Sepal.Width,
colour = Species)) +
geom_point(size = 3)
p1
tidyverse is a collection of 20+ packages built on the philosophy of being organised for the purpose of collaboration.
plotly::ggplotly(p1)
gapminder. http://edinbr.org/edinbr/2016/05/11/may-Hadley-Update2-PostingTalk.htmllibrary(gapminder)
library(dplyr)
library(purrr)
library(tidyr)
library(broom)
library(ggplot2)
gapminder <- gapminder %>% mutate(year1950 = year - 1950)
by_country <- gapminder %>%
group_by(continent, country) %>%
nest()
country_model <- function(df) {
lm(lifeExp ~ year1950, data = df)
}
models <- by_country %>%
mutate(
model = data %>% map(country_model)
)
models <- models %>%
mutate(
glance = model %>% map(broom::glance),
rsq = glance %>% map_dbl("r.squared"),
tidy = model %>% map (broom::tidy),
augment = model %>% map (broom::augment)
)
models %>%
unnest(tidy) %>%
select(continent, country, term, estimate, rsq) %>%
spread(term, estimate) %>%
ggplot(aes(`(Intercept)`, year1950)) +
geom_point(aes(colour = continent, size = rsq)) +
geom_smooth(se = FALSE) +
xlab("Life Expectancy in 1950") +
ylab("Linear yearly improvement") +
ggtitle("LM on life exp. explained by time elapsed from 1950") +
scale_size_area(expression(R^2)) +
theme(legend.position = "bottom")
Use RStudio + RMarkdown to document your codes.
Learn some computational tools. They are not statistics, but not learning them could inhibit your career aspects.
Find "cool" components and adapt those into your work routine. (Hint: start with all RStudio cheatsheets and build up gradually.)
Take time to re-analyse an old dataset.
Learn core functions and vignette.
Don't forget the theories and interpretations! This is a course about statistics after all, not Cranking-Out-Numbers-Less-Than-0.05-And-Reject-Null-Hypothesis-101.
sessionInfo()
R version 3.4.1 (2017-06-30) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Sierra 10.12.3 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib locale: [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] bindrcpp_0.2 janitor_0.3.0 plotly_4.7.0 dplyr_0.7.2 [5] purrr_0.2.2.2 readr_1.1.1 tidyr_0.6.3 tibble_1.3.3 [9] ggplot2_2.2.1 tidyverse_1.1.1 knitr_1.16 loaded via a namespace (and not attached): [1] reshape2_1.4.2 haven_1.1.0 lattice_0.20-35 [4] colorspace_1.3-2 htmltools_0.3.6 viridisLite_0.2.0 [7] yaml_2.1.14 rlang_0.1.1 foreign_0.8-69 [10] glue_1.1.1 modelr_0.1.0 readxl_1.0.0 [13] bindr_0.1 plyr_1.8.4 stringr_1.2.0 [16] munsell_0.4.3 gtable_0.2.0 cellranger_1.1.0 [19] rvest_0.3.2 htmlwidgets_0.9 psych_1.7.5 [22] evaluate_0.10.1 labeling_0.3 forcats_0.2.0 [25] httpuv_1.3.5 crosstalk_1.0.0 parallel_3.4.1 [28] broom_0.4.2 Rcpp_0.12.12 xtable_1.8-2 [31] scales_0.4.1.9002 backports_1.1.0 jsonlite_1.5 [34] mime_0.5 mnormt_1.5-5 hms_0.3 [37] digest_0.6.12 stringi_1.1.5 shiny_1.0.3 [40] grid_3.4.1 rprojroot_1.2 tools_3.4.1 [43] magrittr_1.5 lazyeval_0.2.0 pkgconfig_2.0.1 [46] data.table_1.10.4 xml2_1.1.1 lubridate_1.6.0 [49] assertthat_0.2.0 rmarkdown_1.6 httr_1.2.1 [52] R6_2.2.2 nlme_3.1-131 compiler_3.4.1